Demonstation notebook

Query all the geographic location from EBI Influenza A samples and draw them on a map


In [1]:
import urllib
import pandas as pd

Query, how many flu samples are there?

Build the url for the advanced search

In [2]:
url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?' #base for advanced search
url_query='\"tax_tree(11320)\"' #influenza A taxon and all subordinates (tree)
url_result='&result=sample' # looking for samples, they have location
url_count='&resultcount' # count the results
In [3]:
url=url_base+url_query+url_result+url_count

Query the url

In [4]:
print urllib.urlopen(url).read()
Number of results: 1,095,255
Time taken: 0 seconds

Download geoloco from all samples

Build url

In [5]:
url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?'
url_query='\"tax_tree(11320)\"'
url_result='&result=sample'
url_display='&display=report' #report is the tab separated output
url_fields='&fields=accession,location' #get accesion and location
url_limits='&offset=1&length=1095067' #get all the results
In [6]:
url=url_base+url_query+url_result+url_display+url_fields+url_limits

Download the table to a string

In [7]:
ena_flu_loco_page = urllib.urlopen(url).read()

Load the table into a pandas DataFrame

In [8]:
from StringIO import StringIO
ena_flu_loco_table = pd.read_csv(StringIO(ena_flu_loco_page),sep='\t')

Peek into the table

In [9]:
ena_flu_loco_table.head()
Out[9]:
accession location
0 33124 NaN
1 9544 NaN
2 9545 NaN
3 SAMD00000344 NaN
4 SAMD00000345 NaN

See how many has geolocation

In [10]:
print "The number of sample with geolocations is: ",
print len(ena_flu_loco_table[ pd.isnull(ena_flu_loco_table['location']) == False ])
The number of sample with geolocations is:  125477
In [11]:
ena_loco=ena_flu_loco_table[ pd.isnull(ena_flu_loco_table['location']) == False ]
In [12]:
ena_loco.head()
Out[12]:
accession location
2856 SAMD00003553 44.337 N 143.38083 E
2857 SAMD00003554 44.337 N 143.38083 E
2858 SAMD00003555 44.337 N 143.38083 E
2859 SAMD00003556 44.337 N 143.38083 E
2860 SAMD00003557 44.337 N 143.38083 E

Some location are malformed

In [14]:
err= ena_loco[ [ len(x.split(' '))!=4 for x in ena_loco['location'] ]]
err.head()
Out[14]:
accession location
196956 SAMEA2384944 18.49041 E
196980 SAMEA2384968 18.49041 E
202130 SAMEA2392363 23.1 E
234588 SAMEA2547925 28.23 E
234589 SAMEA2547926 28.23 E

Delete these

In [15]:
ena_loco = ena_loco[ [ len(x.split(' '))==4 for x in ena_loco['location'] ]]

Parse the loniitudes, longitudes

In [26]:
def parse_lat(string_loc):
    loc_list=string_loc.split(' ')
    if (loc_list[1] =='N'):
        return float(loc_list[0])
    elif (loc_list[1] =='S'):
        return -float(loc_list[0])
    
def parse_lon(string_loc):
    loc_list=string_loc.split(' ')
    if (loc_list[3] =='E'):
        return float(loc_list[2])
    elif (loc_list[3] =='W'):
        return -float(loc_list[2])
ena_loco['lat']=map(parse_lat,ena_loco['location'])
ena_loco['lon']=map(parse_lon,ena_loco['location'])
In [27]:
ena_loco.head()
Out[27]:
accession location lat lon
2856 SAMD00003553 44.337 N 143.38083 E 44.337 143.38083
2857 SAMD00003554 44.337 N 143.38083 E 44.337 143.38083
2858 SAMD00003555 44.337 N 143.38083 E 44.337 143.38083
2859 SAMD00003556 44.337 N 143.38083 E 44.337 143.38083
2860 SAMD00003557 44.337 N 143.38083 E 44.337 143.38083

See how many unique locations we have

In [30]:
uniq_locs=ena_loco.groupby(['lat','lon']).size().reset_index()
uniq_locs.columns=['lat','lon','count']
print 'Number of unique locations:',
print len(uniq_locs.sort('count', ascending=False))
Number of unique locations: 16142

Generate popup string for each unique location

  • accesion, an
  • long list shorted because it breaks the map
  • anything could be eritten
In [63]:
def form_acc(x):
    if (x['accession'].size < 5):
        return pd.Series(
            dict(count = x['accession'].size, acc_list = ' '.join(x['accession'])))
    else:
        return pd.Series(
            dict(count = x['accession'].size, acc_list = ' '.join(list(
                        x['accession'])[:2]) + ' ... ' + ' '.join(list(
                        x['accession'])[-2:])))

uniq_locs_w_acc=ena_loco.groupby(['lat','lon']).apply(form_acc).reset_index()

Plot points on map

  • Point size proportional to number of cases
    • miseleading, beacuse somewhere all the cases around have the sample position (Europe), and somewhere the positions are more scattered (Shanghai)
In [65]:
from IPython.core.display import HTML
import folium

def inline_map(m, width=650, height=500):
    """Takes a folium instance and embed HTML."""
    m._build_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{}" '
                 'style="width: {}px; height: {}px; '
                 'border: none"></iframe>'.format(srcdoc, width, height))
    return embed


width, height = 650, 500
flu_map = folium.Map(location=[47, -17], zoom_start=3,
                    tiles='OpenStreetMap', width=width, height=height)

for i in xrange(len(uniq_locs_w_acc)):
    loc=(uniq_locs_w_acc.iloc[i]['lat'],uniq_locs_w_acc.iloc[i]['lon'] )
    name=uniq_locs_w_acc.iloc[i]['acc_list']
    size=uniq_locs_w_acc.iloc[i]['count']
    flu_map.circle_marker(location=loc, radius=100*size,
                          line_color='none',fill_color='#3186cc',
                          fill_opacity=0.7, popup=name)

inline_map(flu_map)
Out[65]:

Conclusions:

  • Some bad geolocations line at 0 longitude
  • With 16k points it uses ~500Mb in my chrome
  • Very slow on a Bay-Trail intel proc
  • Folium has limited drawing capabilities